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Abstract 

To  solve  directly  a  sparse,  unsymmetric  matrix  equation  Ax  =  b,  an 
equation-ordering  algorithm  based  on  local  equation  decoupling  is  proposed 
to  maintain  a  high  flout  rate  of  scalar  computations  utithin  a  floating  point 
pipeline.  Software  is  described  to  solve  highly-sparse  unpatterned  systems 
efficiently  via  explicit  code  generation.  Rates  in  the  range  of  15  MFLOPS 

on  the  CRAY-1  are  achieved. 
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I.  Introduction 

Vector  processors  have  forced  a  reconsideration  of  traditional  compu¬ 
tational  algorithms.  In  the  solution  of  sparse  systems  of  equations,  such 
study  has  resulted  in  a  proliferation  of  methods  to  service  a  variety  of 
sparsity  characteristics. 

Early  work  in  general  vectorized  sparse  methods  C13,  yielding  codes  that 
appeared  to  the  user  similar  to  "tra ditional"  scalar  counterparts  C23,  had 
limited  intelligence  to  identify  vector  operations.  For  important  classes 
of  both  highly-sparse  and  relatively  dense  systems,  special  codes  were  later 
found  to  achieve  speedups  of  3:1  to  20:1  over  the  general  vector  code.  From 
an  algorithmic  viewpoint,  it  appears  that  the  notion  of  a  general  sparsity 
code  for  vector  architectures  may  be  an  anachronism  (however,  see  C173). 

An  exception  occurs  where  such  speeddowns  can  be  tolerated  in  the 
interest  of  user  convenience;  for  example,  in  a  small  highly-sparse  system, 
the  equation  solution  time  may  be  a  small  fraction  of  the  equation- 
formulation  time  and  such  inefficiencies  may  be  acceptable. 

In  general,  such  speedups  are  achieved  either  by 

(a)  locally  decoupling  of  equations  so  that  the  pipelines  can  be 
"crammed"  with  independent  computations  associated  with  uncou¬ 
pled  equations;  such  methods  are  useful  in  highly-sparse  systems; 

(b)  local  coupling  of  equations  so  that  vectors  can  be  defined  within 
dense  banded  or  locally  blocked  sparse  systems. 

Figure  1  illustrates  that  each  of  these  approaches  can  be  further  clas¬ 
sified.  In  the  case  of  dense  systems,  usually  associated  with  elliptic  fin¬ 
ite  element  and  finite  difference  problems,  coupling  is  exploited  either  (a> 
within  a  grid  point  (node)  -  with  many  unknowns/node  C33  -  or  within  a  fin¬ 
ite  element  C43  or  (b)  across  grid  points,  yielding  banded  and  profile  systems 
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C53.  These  are  termed  intranodal  (intra-element),  and  internodal  (inter¬ 
element)  coupling,  respectively.  In  general,  internodal  coupling  yields 
denser  submatrices,  longer  vectors,  and  so  higher  execution  rates. 

When  highly-spa  rse  equations  are  decoupled,  a  distinction  must  be  made 
between  patterned  and  unpatterned  systems. 

In  the  former,  it  is  presumed  that  (a)  submatrices  with  identical  spar¬ 
sity  patterns  can  be  identified,  and  (b)  these  submatrices  are  stored  with 
similarly-positioned  elements  a  constant  stride  apart.  (This  latter  res¬ 
triction  is  more  important  for  highly  sparse  systems,  where  one  cannot 
afford  to  remap  the  matrix  by  gather/scatter  operations;  however,  the 
existence  of  patterns  usually  implies  that  subsystem  matrices  can  be 
simultaneously  formulated  in  vector  mode  so  that  (b)  is  often  satisfied.) 
These  conditions  apply,  for  example,  to  large  electronic  circuit  matrices 
C73C83  and  assure  vectorizability. 

The  above  vectorizable  dense  and  patterned  sparse  matrix  cases 
account  for  the  majority  of  sparse  problems.  Indeed,  sparse  matrices 
become  large  usually  by  means  of  a  formulation  algorithm  that  guarantees 
vectorizability. 

However,  there  do  exist  relatively  small  (  <  5000  equations)  highly- 
sparse  problems  with  undiscernable  patterns:  some  electronic  circuits, 
electrical  power  systems,  small  dissected  2-D  finite  element  grids  C63, 
occurring  perhaps  as  a  part  of  a  3D  iterative  solution.  In  more  exotic  for¬ 
mulations,  an  unpatterned  matrix  may  represent  only  part  of  a  large 
sparse  system  171.  In  any  case,  such  structures  pose  a  most  difficult  algo¬ 
rithmic  challenge,  apart  from  their  arguable  utility.  It  is  to  these  prob¬ 
lems  that  the  report  is  addressed.  The  results  of  the  report  were  first 
given  in  C143  and  Cl  53. 
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One  additional  caveat  is  offered  before  proceeding.  It  will  be  neces¬ 
sary  to  preprocess  the  sparsity  structure  before  the  numerical  solution  is 
carried  out.  On  the  CRAY-1,  this  preprocessing  time  is  several  hundred 
times  the  matrix  solution  time.  Therefore,  this  procedure  is  appropriate 
only  when  multiple  numerical  solutions  are  required  with  the  same  spar¬ 
sity  structure. 


II.  Algorithms 
A.  Parallel  Solution 

Consider  an  unsymmetric  matrix  equation  of  the  form 


Ax  =*  b 

where  A  is  an  nxn  matrix  and  x  and  b  are  nxl  vectors.  This  equation  is  to  be 
solved  by  LU  factorization,  viz, 

1.  Factor  A  =  LU,  where  L  and  U  are  lower  and  upper  triangular 
matrices,  respectively. 

2.  Solve  Uy  3  b  for  y  (forward  substitution). 


3.  Solve  Ux  »  y  for  x  (backward  substitution). 

The  matrix  A  is  considered  locally  decoupled  if  the  combined  structure 
of  its  LU  factors  has  the  form  till. 
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where  D  _  is  a  diagonal  matrix  and  L  .  t  and  U  4  extend  from  D  to  the 

~r  r+l,T  f  |T+  I  TT 

matrix  boundary.  The  ordered  steps  to  reduce  the  rth  pivot  block  and  the 
associated  right  band  side  components  are 


Drr 

<- 

D~  1 

TT 

(reciprocation) 

(1) 

*"r+  l,r 

<- 

L  D“ 1 

r+l,r*rr 

(multiplication) 

(2) 

r  +  l,r+ 1 

<- 

Ar  +  1,t+  l-^r+l,T^r,r+l 

(mult./ subtract  ion) 

(3) 

Yr+1 

<- 

Y  -L  Y 

r+1  r+l,rr 

(mult./subtraction) 

(4) 

where  T+ t  represents  the  unreduced  southeast  corner  of  t  he  matrix 

and  Y^+  ^  is  the  associated  right  hand  side  at  the  rth  reduction  step.  The 
block  back  substitution  has  the  form 


VUr,r+lVl 


<-  YD 

r  rr 


<5> 

(6) 


where  «T  is  the  rth  block  component  of  the  solution  vector.  Equations  (1)- 

<4)  can  be  performed  in  three  parallel  steps.  That  is,  except  for  the  sub¬ 
traction,  all  right  hand  side  matrix  elements  -  operands  of  unary  and  binary 
floating  point  computations  -  are  known  on  entry  to  the  step.  (The  subtrac¬ 
tions  can  be  processed  efficiently  at  the  coding  level  but  can  not  always  be 
performed  in  parallel.)  Indeed,  the  sparser  the  equations,  the  greater  the 
decoupling  (  dimension  of  D  )  and  the  more  parallel  the  solution. 

rr 
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B.  Pipelined  Sola tion 

Calculation  of  vector  or  scalar  results  in  a  pipeline  requires  that 
operations  in  the  pipeline  at  any  time  be  independent.  Without  this 
independence,  results  must  be  secured  in  registers  or,  morse,  main  memory, 
before  they  can  be  operands  for  a  succeeding  computation.  Independence 
ideally  permits  pipelines  to  be  crammed  with  vector  or  scalar  operands. 
Thus,  parallel  and  pipelined  architectures  make  a  similar  demand  on  the 
organization  of  an  efficient  solution  algorithm!  equations  <l>-<6>  are, 
therefore,  also  the  basis  of  the  proposed  pipelined  solution. 

C.  Code  generation 

If  the  elements  of  the  0  are  stored  a  fixed  address  increment  apart, 

TT 

then  conceivably  floating  point  operations  could  be  performed  in  vector 
mode.  However,  assuming  column-ordered  matrix  storage  for  compatibility 
with  existing  programs,  the  cost  of  gathering  the  diagonals  into  this  vector 
storage  format  will  likely  not  be  worth  the  advantage  of  vectorization. 
Similar  arguments  apply  to  the  other  two  highly  sparse  parallel  steps.  For 
the  CRAY-1,  with  slow  gather/scatter  operations,  it  is  assumed  that  float¬ 
ing  point  operations  should  instead  be  performed  in  scalar  mode. 

To  achieve  the  highest  speed  scalar  operation,  it  was  decided  to  gen¬ 
erate  explicit  loopless  scalar  code  in  the  manner  of  Gustavson  C93.  This 
avoids  the  issuing  of  address  operations  -  costly  on  the  CRAY-1  -  since 
the  addresses  are  imbedded  in  the  scalar  code.  Thus,  when  a  series  of  con¬ 
secutive  reciprocations,  (or  multiplications,  or  subtractions)  is  to  be 
performed,  the  code  generator  produces,  in  a  preprocessing  step,  a  sequence 
of  similar  scalar  operations  with  different  addresses.  Because  the 
instructions  are  identical  except  for  addresses,  the  associated  scalar 
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fetches,  floating  point  operations,  and  stores  can  be  overlapped  in  a 
predictable  manner. 

Table  1  gives  a  summary  of  the  asymptotic  rates  of  each  of  these  gen¬ 
erated  code  sequences,  with  and  without  overlapping.  Since  the 
multiplies/subtracts  dominate  the  other  computation  in  any  but  the  spar¬ 
sest  matrix,  the  execution  rate  for  a  large  matrix  should  approach  that  of 
the  multiply/subtract  kernel  (14.8  MFLOPS).  The  detailed  kernel  overlapped 
timings  from  a  CRAY-1  simulator  Cl 03  are  given  in  Table  2. 

D.  Ordering 

The  restricted  utility  of  this  class  of  sparse  matrix  algorithms  to 
small  highly  sparse  systems  suggests  that  available  ordering  techniques 
and  software  be  modified,  rather  than  new  software  be  developed.  For  this 
reason,  the  following  procedure  represents  a  variation  on  the  so-called 
minimum-degree  algorithm,  but  applied  to  unsymmetric  matrices.  It  is 
accepted  apriori  that  specialized  ordering  software  may  execute  more  effi¬ 
ciently. 

First,  the  conventional  MD  minimizing  algorithm  is  reviewed.  At  the  kth 
step  in  the  ordering,  let  pr(m)  and  pc(m),  m  *  l,..n,  represent  the  row-  and 
column-ordering  permutation  vectors,  with  pr(m)  =  pc(m)  *  m  for  k  *  1. 
Also,  let  and  n^C(j)  b®  the  number  of  k+l,..n  and  j  ■  k,k+l,...n,  respec¬ 
tively.  There,  among  the  non-zero  elements  e  ...  , ..,  the  pivot  positions 

pru 

and  are  chosen  such  that 

"  <l»ii:min  <%T(i)- 1  )(npc(  1  ^  e>T(i),K( j,  *  0;  k<i<n,k<  j<«> 

The  modified  algorithm  insures  the  local  decoupling  of  equations  and 
variables,  as  follows.  When  pivot  positions  and  are  selected  within 
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block  r,  the  routs  and  columns  coupled  to  the  block’s  pivot  positions  are 
marked;  pivoting  is  then  prohibited  on  non-zeros  in  these  routs  and  columns. 
When  no  non-zero,  unmarked  pivots  exist,  a  neut  block  is  initiated  by  incre¬ 
menting  r  and  clearing  markers. 

E.  Limited  Decoupling 

The  memory  hierarchy  of  the  CRAY-1  suggests  that  the  full  decoupling 
allowed  by  this  ordering  should  not  be  exploited.  It  is  preferred  that  the 
results  from  the  first  two  steps  of  (2)  and  <3)  be  maintained  in  64  scalar 
(T)  registers.  This  necessitates  that  the  total  number  of  elements  in  D 

TT 

and  L  be  no  greater  than  64,  since  all  elements  of  these  two  matrices 

T+  X  fT 

are  required  in  (3)  and  (4).  By  correspondingly  limiting  the  dimension  Df 
Din  the  ordering  algorithm,  the  minimal  degree  criterion  is,  on  the  aver- 

age,  less  constrained  during  a  pivot  selection  than  if  maximum  decoupling 
were  demanded  within  each  block.  In  the  limit,  if  the  dimension  of  D _ is 

constrained  to  be  unity,  a  true  HD  criterion  results  and  the  MD  operation 
count  should  be  achieved. 

Viewed  another  way,  since  the  scalar  register  file  size  is  limited  at  64, 
a  family  of  matrices  increasing  in  size  should  be  less  impacted  by  the 
limited-decoupling  strategy  as  the  size  increases.  Thus  a  matrix  of  large 
dimension  should  achieve  a  nearly  minimal  (HD)  operation  count. 

A  flow  chart  of  the  limited  decoupling  algorithm  is  shown  in  Figure  3. 
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III.  Software 

A.  Introduction 

The  program  has  two  parts: 

(1)  A  Fortran  symbolic  preprocessor  that  (a)  orders  the  equations 
according  to  the  limited  decoupling  algorithms,  (b)  generates  CRAY-1 
machine  code  in  a  buffer  array,  and  (c)  writes  this  code  into  a  file 
in  unformatted  form. 

(2)  A  program  that  <a)  reads  the  code  into  main  memory,  (b)  formu¬ 
lates  a  set  of  equations  of  the  prescribed  sparsity  from  random¬ 
valued  numerical  data,  and  (c)  calls  (from  Fortran)  a  short  inter¬ 
face  program  that  jumps  to  the  code. 

The  flow  chart  is  given  in  Figure  2.  Note  that  the  same  code  suffices  to 
solve  multiple  numeric  solutions. 

B.  Inputs  To  Symbolic  Phase 

The  symbolic  phase  reads  the  following  data. 

1.  N  -  the  number  of  equations. 

2.  NRMAP  *  0  if  numeric  values  stored  in  column  order!  NRMAP  =  1  if 
order  of  numeric  values  is  given  in  NUMN 

3.  JA  -  an  array  of  dimension  N+lJ  JA(J)  points  to  the  first  ele¬ 
ment  of  the  Jth  column  in  array  I  A;  UA(N+  1 )  points  to  one  beyond  the 
last  element  of  IA. 

4.  IA  -  an  array  of  dimension  NA  =  JA(N+1)-1,  containing  the 
column-ordered  row  indices. 

5.  NUMN  -  an  array,  usually  of  dimension  NA;  NUMN(J)  gives  the 
location  in  data  array  A  of  the  element  corresponding  to  IA(J). 
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Although  written  as  a  self-contained  research  program  (e.g.,  facilities 
are  provided  to  generate  randomly-positioned  matrices,  to  count  opera¬ 
tions,  and  produce  a  printer  map),  it  is  relatively  clear  which  facilities 
should  be  deleted  for  production  use.  Also,  the  above  data  could  be 
transferred  in  an  argument  list. 

If  NRMAP  =  0,  it  is  assumed  that  NUMN(J>  =  J,  and  NUM(J>  is  not  refer¬ 
enced  further;  the  dimension  of  NUM  need  then  be  only  unity. 

C.  Numeric  Solution  Phase 

A  Fortran  test  driver  (Table  3)  was  developed  to  formulate  a 
randomly-valued  matrix  of  the  sparsity  prescribed  by  JA  and  IA.  Moreover, 
the  values  are  mapped  according  to  NUMN  and,  to  insure  numerical  domi¬ 
nance  of  the  pivot  positions,  pivots  are  located  from  an  array  passed  from 
the  symbolic  program.  The  right  band  side  is  formulated  so  that  the  solu¬ 
tion  vector  X  has  the  value  X(J)  =  J  +  1. 

The  linkage  in  the  code  that  performs  the  LU  factorization,  the  forward 
substitution,  and  the  back  substitution  is  made  by  the  subroutine  invoca¬ 
tion 

CALL  EX  EC  (INST,  A,  B,  X,  N) 

where 

INST  is  an  array  containing  the  machine  code 

A  is  the  matrix  numeric  values,  packed  according  to  NUMN  and  IA 
B  is  the  right  hand  side 
X  is  the  solution 
N  is  the  number  of  equations. 


All  reciprocations  are  half  precision. 


D.  Ordering  Su brou tine 

The  equation-ordering  program,  being  a  critical  element  of  this 
software  package,  deserves  separate  documentation.  A  list  of  its  calling 
arguments  follows. 


CALL  SPCPIV(N,  NA,  IA,  JA,  IROW,  JCOL, 

JROW,  ICOL,  INUM,  JNUM,  IMIN, 
JMIN,  IPR,  IPC,  ISIZE,  IBLC, 
IBLR,  IBLOCK,  IMINT,  JMINT, 
IMAP,  ICALC,  NMAP,  NBL,  IPIV, 
ICMAX,  IDP). 

where 


N* 

NA* 

NMAP 

NBL 

ICMAX* 

ISIZE* 

IPIV* 

IDP* 

IA< J)* 

IBLOCK(J) 


is  the  number  of  equations 
is  the  number  of  non-zero  elements 
is  the  number  of  elements  of  MAP 
is  the  expected  number  of  diagonal  blocks;  .LE.N 
must  be  set  to  64  by  user 

is  the  maximum  expected  number  of  non-zeros  of  L  and  U  (combined) 

*0  if  limited  decoupling  is  desired 

=  1  if  no  decoupling  is  desired!  MD  ordering  criterion  is  then  used 

=»0  for  unspecified  tie-breaking  in  MD  ordering 
*1  for  diagonal  preference  in  tie-breaking 

contains  column-ordered  row  number  of  non-zero  positions 
of  matrix;  dimension  is  NA 

is  the  row  nu  mber  <=  column  number)  of  first  element 

of  Jth  diagonal  block;  dimension  at  least  NBL+  1 ;  IBLOCK  <N+  1 )  *=  N8L+1 
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(The  following  arrays  mu  st  have  a  dimension  ofat  least  N  or  N+l) 

JA(J)*  is  the  location  in  JA  of  the  first  non-zero  element 
in  the  Jth  colu  mni  JA(N+  l )  =  NA+1 

IPR(J)  is  the  Jth  pivot  row  number 

IPC<J)  is  the  Jth  pivot  column  number 

ICALC(J)  is  the  number  of  floating  point  operations  to  factor 
the  matrix  through  the  Jth  column 

JNUM,  INUM,  IMINT,  JMINT,  IBLR,  IELC,  IMIN,  JMIN,  JROW,  ICOL 
are  working  arrays 

The  following  arrays  must  have  a  dimension  equal  to  the  number  of  expected 
non- zeros  of  L  and  U  combined,  plus  N,  the  number  of  equations. 

IROW  and  JCOL  are  working  arrays 

IMAP  contains  information  related  to  the  map  of  L  and  U  in 

alternating  row-  and  column-order;  diagonal  elements  are 
represented  twice,  requiring  N  additional  locations. 

IV.  Performance 
A.  Choice  of  examples 

Because  the  dimension  of  the  matrix  is  limited  by  the  size  of  code 
stored  in  main  memory,  the  number  of  applications  of  this  procedure  is  lim¬ 
ited.  On  the  other  hand,  within  this  class  of  highly-sparse  systems,  the  code 
length  and  other  performance  aspects  appear  to  be  relatively  sensitive  to 
sparsity  features  from  different  applications.  Therefore,  illustrative 
problems  have  been  chosen  from  a  number  of  applications,  namely, 

1.  Electronic  circuit  analysis, 


♦Input  data  to  subroutine. 
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2.  General  elliptic  PDE  solution  (by  nested  dissection), 

3.  Oil  reservoir  analysis, 

4.  Electrical  power  systems  analysis. 

The  first  class  of  problems  is  unsymmetrical  in  structure!  the  latter 
three  classes  are  symmetrical  in  structure,  but  are  assumed,  for  purposes 
of  this  study,  unsymmetrical  in  value.  In  all  cases,  off-diagonal  pivoting 
is  allowed. 

B.  Effect  of  ordering 

It  is  well-known  that  the  operation  counts  associated  with  the  order¬ 
ings  of  highly-sparse  matrices  are  sensitive  to  the  tie-breaking  procedure. 
The  current  ordering  algorithm  is  not  necessarily  optimized  in  this  respect 
(see  Cl 21  and  C13I).  However,  two  options  have  been  incorporated  in  the  pro¬ 
gram: 

(a)  choosing  the  "first-found"  tied  pivot,  and 

(b)  preference  for  diagonal  pivots. 

In  general,  it  has  been  found  desirable  for  symmetrically-structured 
matrices  to  favor  diagonal  pivots.  Unsymmetrically-structu  red  matrices 
yield  mixed  results. 

Floating-point  operation  counts  for  a  number  of  problems  are  given  in 
Table  3,  with  MD  ordering  and  with  the  limited  decoupling  algorithm.  The 
penalty  incurred  for  decoupling  is  moderate  and  decreases  on  a  fractional 
basis  as  the  matrix  size  increases  (as  previously  predicted).  These  results 
are  not  surprising,  since  in  many  model  finite  difference  problems  C63, 
minimal  operation  counts  are  associated  with  the  decoupling  proposed  here. 


C.  Code  length  and  performance 

Table  4  gives  the  timing  and  storage  results  of  a  number  of  problems. 
The  "effective"  MFLOPS  are  the  actual  MFLOPS  multiplied  by  the  degrada¬ 
tion  due  to  the  extra  floating  point  computations  necessitated  by  forced 
decoupling,  visa  vis  MD  ordering  (see  Table  3). 

The  following  should  be  noted. 

(1)  The  code  length  is  approximately  equal  to  the  number  of  floating 
point  operations,  in  64-bit  words.  This  allows  one  to  estimate  the 
feasibility  of  code  generation  for  a  problem  with  a  krown  complex¬ 
ity.  For  example,  from  Table  4,  a  million-word  memory  would  seem 
to  be  adequate  to  store  code  for  the  largest  real-valued  electrical 
power  system  problem  and,  perhaps,  a  1000-equation  complex-valued 
system.  Electronic  circuits  in  the  range  of  5000  equations  should 
be  readily  handled.  Five-point  2-D  square  finite  difference  grids 
solved  by  nested  dissection  C63  have  by  a  known  solution  complexity 
of  %  20  n3;  these  can  be  solved  for  m  436. 

(2)  The  code  generation  time,  exclusive  of  writing  the  code  to  a 
file,  is  approximately  18  nsec  per  floating-point  operation,  or 
200-400  times  the  equation  solution  time.  Together  with  the 
storage  results  above,  approximately  18  second*,  suffice  to  gen¬ 
erate  a  million  words  of  code.  In  general,  the  code  generation  time 
is  less  than  the  equ ation-ordering  time  for  highly-sparse  problems} 
denser  matrices,  such  as  those  associated  with  0-4  ordered  reser¬ 
voir  grids,  have  the  opposite  relation. 

<3)  The  execution  rates  (MFLOPS)  is  relatively  insensitive  to 
variations  in  the  matrix  size  and  density.  For  example,  the  highly- 
sparse  power  system  and  electronic  circuit  matrices  yield  rates  in 
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the  range  of  11.6-15  MFLOPS,  whereas  the  denser  grid-related 
matrices  can  be  solved  at  16-18  MFLQPS.  (Of  course,  model  grid 
problems  can  be  solved  at  for  higher  rates  by  band-related  methods 
C161).  This  insensitivity  is  due  to  the  independent  (parallel) 
element-level  operations  that  are  associated  both  with  dense 
matrices  and  with  decoupled  sparse  matrices. 


It  is  reasonable  to  conclude  that  11  MFLQPS  represents  a  Lower  bound 
of  the  solution  rate  of  any  sparse  matrix  requiring  fewer  than  one  million 
floating  point  operations. 
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Figure  1.  Classification  of  sparse  matrix  vectorized 
algorithms  and  CRAY-1  available  software 
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Figure  3.  Flow  chart  of  limited-decoupling  ordering  algorithm) 
(A)  represents  coding  to  process  case  when  more  than 
non-zeros  are  in  a  column 
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Inner  loop  timinox 

Execution  rate 

(Clocks ) 

(MFLOPS) 

Kernel 
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2.94 

10.7 

Multiplication 

23.9 

7.25 

3.36 

11.0 

Multi. /Subt. 

29.5 

10.8 

5.42 

1 4.B 

xTiming  includes  instruction  fetching. 

•  Half-precision. 

♦  Result  store  overlapped  with  next  operand  fetch. 

Table  1.  CRAY-1  kernel  performance 
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Table  2 (b) .  Multiply  kernel  activity  report 
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GENERAL  TEST  DRIVER  FOR  NUMERIC  SOLVER  _  r.rtII,<rTrt„_ 

THE  DIMENSION  OF  THE  FOLLOWING  SHOULD  BE  .GE.  #  OF  EQUATIONS  +1 
DIMENSION  SUMR(  801 ) ,  SUMC(801),  IPIV(801),  B(801),  X(801),  JA(801) 
C  *«»  THE  DIMENSION  OF  THE  FOLLOWING  SHOULD  BE  .GE.  fi  OF  NONZEROS  OF 
C  MATRIX 

DIMENSION  IA(9000) 

THE  DIMENSION  OF  THE  FOLLOWING  SHOULD  BE  .GE.  #  OF  NONZEROS  OF  LU 
DIMENSION  NUMN( 10000),  A( 10000) 

DIMENSION  THE  FOLLOWING  ARRAY  TO  HOLD  THE  CODE 
DIMENSION  INST( 100000) 

C**«  SOME  EQUIVALENCES  COULD  BE  MADE  BETWEEN  ABOVE  ARRAYS 
READ  (5,10)  N,  NRMAP 
10  FORMAT  (1615) 

NP1  =  N  +  1 

READ  (5,10)  (JA(J) ,J=1 ,NP1) 

NA  =  JA(NPI)  -  1 

READ  (5,10)  ( I A ( J ) , J=1 ,NA) 

IF  (NRMAP  .NE.  0)  READ  (5,10)  (NUMN( J) , J=1 ,NA) 

C*«*  ROW  INDEX  OF  PIVOT  POSITIONS 
READ  (3,10)  ( IPIV (I),I=1,N) 

C*«*  ZERO  LU  STORAGE 

DO  20  I  s  1 ,  10000 
20  A ( I )  =  0. 

C*«*  UNIFORMLY-DISTRIBUTED  NEGATIVE  OFF-DIAGONAL  VALUES 


NNN  =  999 
DO  30  J  *  1,  NA 
JJ  =  J 

IF  (NRMAP  .NE.  0)  JJ  =  NUMN(J) 

30  A( JJ )  =  -URAND(NNN) 

DO  40  J  =  1,  N 
SUMR(J)  =  0. 

SUMC(J)  =  0. 

40  B( J)  =  0 

C»««  FORMULATE  EQUATIONS  SO  SOLUTION  IS  X(I)  =1+1 
DO  60  I  =  1  ,  N 

11  =  JA(I) 

12  =  JA( I  +  1)  -  1 
DO  50  J  =  II,  12 

ICOL  =  IA( J ) 

JJ  =  J 

IF  (NRMAP  .NE.  0)  JJ  =  NUMN(J) 

SUMC(I)  =  SUMC(I)  -  A( JJ ) 

SUMR(ICOL)  =  SUMR(ICOL)  -  A(JJ) 

50  B(ICOL)  s  B(ICOL)  +  A(JJ)  •  (I  +  1) 

60  CONTINUE 


Table  3. 


Example  driver  for  numeric  phazo 


non 


C«««  FIND  PIVOTS  AND  FORCE  DOMINANCE 
DO  80  I  =  1,  N 

11  =  JA{ I ) 

12  =  JA( I  +  1)  -  1 
DO  70  J  =11,  12 

ICOL  =  IA( J) 

JJ  =  J 

IF  (NRMAP  .NE.  0)  JJ  =  NUMN(J) 

IF  (ICOL  .NE.  IPIV(I) )  GO  TO  70 
B(ICOL)  =  B(ICOL)  -  A( JJ )  •  (I  +  1) 

A( J J )  =  .01  +  1.1E0  *  AMAX1 ( SUMC ( I )  +  A( J J ) , SUMR( ICOL)  +  A(JJ 
1  ) 

B(ICOL)  =  B(ICOL)  +  A( J J )  »  (I  +  1) 

GO  TO  80 
70  CONTINUE 
80  CONTINUE 

READ  (8,90)  NINSTW 
90  FORMAT  (16) 

READ  (8)  (INST(I), 1=1, NINSTW) 

C»»*  A  IS  COLUMN-ORDERED  PACKED  MATRIX 
B  IS  RIGHT  HAND  SIDE 
X  IS  SOLUTION 
N  IS  #  OF  EQUATIONS 
CALL  SOLVE(INST,  A,  B,  X,  N) 

WRITE  (6,100)  (X(I), 1=1, N) 

100  FORMAT  (5E12.^) 

STOP 
END 


Table  3. 


Continued 


I 

I 


#  of 

equations 

Description 

No 

Decoupling 

Decoupling 

289 

17x17  5  pt. 

2-D  grid 

52060 

59626 

443 

Elec.  Power  System 

7528 

9394 

450 

Electronic  circuit 
4-bit  adder 

6931 

7122 

507 

Oil  reservoir 

96479 

108478 

2323 

Oil  reservoir 

1 360000 

1407069 

5300 

Elec,  power  system 

465000 

534077 

Table  4.  Floating  point  operation  counts  to  factorize  matrices 


#  of 
equat. 

Description 

Code  stor. 
(64-bit  wds) 

FP 

oper. 

MFL.OPS 

Eff. 

MFLOPS 

Ordering 

Time 

(msic) 

Code 

Gen. 

Solu.  I 

160 

Elec.  Power 
Sys. 

7691 

7945 

15.3 

-- 

90.5 

145 

.520  1 

289 

17x17  5-pt 
grid  nested 
dissect. 

63375 

591 02 

14.5 

12.6 

758 

1250 

4.06 

391 

Oil reserv. 
D-4  ordered 

43096 

46296 

16.5 

-- 

583 

796 

2.79 

443 

Elec.  Power 
Sys. 

14157 

14001 

14.7 

11.7 

311 

250 

.948 

450 

Elec.  cir. 
4-bit  adder 

12791 

12370 

14.3 

13.9 

314 

213 

.864 

507 

Oil reserv. 
D-4  ordered 

113566 

125117 

17.3 

15.4 

1823 

2260 

7.24 

1746 

Elec.  cir. 
16-bit  adder 

45758 

43779 

14.3 

14.2 

3520 

695 

3.06 

5300 

Elec.  Power 
Sys. 

585253 

634837 

11.6 

10.1 

54313 

25700 

58.37 

Table  5.  Result  summary.  Different  operating  systems 
(CCQS  and  GTSS)  were  used  to  solve  different  problems;  unexplained 
variability  was  noted  in  CTSS  timings.  t 
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